NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Privacy-Preserving Range Aggregation Queries Using a Learning-Based Approach

https://doi.org/10.1109/PerComWorkshops65533.2025.00102

Guan, Hong; Zou, Jia (March 2025, IEEE)

Free, publicly-accessible full text available March 17, 2026
Privacy and Accuracy-Aware AI/ML Model Deduplication

https://doi.org/10.1145/3725340

Guan, Hong; Yu, Lei; Zhou, Lixi; Xiong, Li; Chowdhury, Kanchan; Xie, Lulu; Xiao, Xusheng; Zou, Jia (June 2025, Proceedings of the ACM on Management of Data)

With the growing adoption of privacy-preserving machine learning algorithms, such as Differentially Private Stochastic Gradient Descent (DP-SGD), training or fine-tuning models on private datasets has become increasingly prevalent. This shift has led to the need for models offering varying privacy guarantees and utility levels to satisfy diverse user requirements. Managing numerous versions of large models introduces significant operational challenges, including increased inference latency, higher resource consumption, and elevated costs. Model deduplication is a technique widely used by many model serving and database systems to support high-performance and low-cost inference queries and model diagnosis queries. However, none of the existing model deduplication works has considered privacy, leading to unbounded aggregation of privacy costs for certain deduplicated models and inefficiencies when applied to deduplicate DP-trained models. We formalize the problem of deduplicating DP-trained models for the first time and propose a novel privacy- and accuracy-aware deduplication mechanism to address the problem. We developed a greedy strategy to select and assign base models to target models to minimize storage and privacy costs. When deduplicating a target model, we dynamically schedule accuracy validations and apply the Sparse Vector Technique to reduce the privacy costs associated with private validation data. Compared to baselines, our approach improved the compression ratio by up to 35× for individual models (including large language models and vision transformers). We also observed up to 43× inference speedup due to the reduction of I/O operations.
more » « less
Free, publicly-accessible full text available June 17, 2026
DATAMORPHER: Automatic Data Transformation using LLM-Based Zero-Shot Code Generation

https://doi.org/10.1109/ICDE65448.2025.00346

Sharma, Ankita; Tandel, Jaykumar; Li, Xuanmao; Wang, Lanjun; Fariha, Anna; Zhang, Liang; Naqvi, Syed_Arsalan_Ahmed; Riaz, Irbaz_Bin; Cao, Lei; Zou, Jia (May 2025, 2025 IEEE 41st International Conference on Data Engineering (ICDE))

Free, publicly-accessible full text available May 7, 2026
IDNet: A Novel Identity Document Dataset via Few-Shot and Quality-Driven Synthetic Data Generation

https://doi.org/10.1109/BigData62323.2024.10825017

Xie, Lulu; Wang, Yancheng; Guan, Hong; Nag, Soham; Goel, Rajeev; Swamy, Niranjan; Yang, Yingzhen; Xiao, Chaowei; Prisby, Jonathan; Maciejewski, Ross; et al (December 2024, IEEE)

Full Text Available
DeepMapping: Learned Data Mapping for Lossless Compression and Efficient Lookup

https://doi.org/10.1109/ICDE60146.2024.00008

Zhou, Lixi; Candan, K Selçuk; Zou, Jia (May 2024, IEEE)

Storing tabular data to balance storage and query efficiency is a long-standing research question in the database community. In this work, we argue and show that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Such unique properties may benefit a broad class of use cases in capacity-limited devices. Our proposed DeepMapping abstraction transforms a dataset into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding values for a given input key. To deal with memorization errors, DeepMapping couples the learned neural network with a lightweight auxiliary data structure capable of correcting mistakes. The auxiliary structure design further enables DeepMapping to efficiently deal with insertions, deletions, and updates even without retraining the mapping. We propose a multi-task search strategy for selecting the hybrid DeepMapping structures (including model architecture and auxiliary structure) with a desirable trade-off among memorization capacity, size, and efficiency. Extensive experiments with a real-world dataset, synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the DeepMapping approach can better balance the retrieving speed and compression ratio against several cutting-edge competitors.
more » « less
Full Text Available
Collaborative large language models for automated data extraction in living systematic reviews

https://doi.org/10.1093/jamia/ocae325

Khan, Muhammad Ali; Ayub, Umair; Naqvi, Syed_Arsalan Ahmed; Khakwani, Kaneez_Zahra Rubab; Sipra, Zaryab_bin Riaz; Raina, Ammad; Zhou, Sihan; He, Huan; Saeidi, Amir; Hasan, Bashar; et al (January 2025, Journal of the American Medical Informatics Association)

Abstract ObjectiveData extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world 2-reviewer process. Materials and MethodsA dataset of 10 trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and held-out test sets (n = 17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the 2 LLMs were considered concordant if they were the same for a given variable. The discordant responses from each LLM were provided to the other LLM for cross-critique. Accuracy, ie, the total number of correct responses divided by the total number of responses, was computed to assess performance. ResultsIn the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, increasing accuracy to 0.76. DiscussionConcordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy. ConclusionLarge language models, when simulated in a collaborative, 2-reviewer workflow, can extract data with reasonable performance, enabling truly “living” systematic reviews.
more » « less
Free, publicly-accessible full text available January 21, 2026
A Comparison of End-to-End Decision Forest Inference Pipelines

https://doi.org/10.1145/3620678.3624656

Guan, Hong; Masood, Saif; Dwarampudi, Mahidhar; Gunda, Venkatesh; Min, Hong; Yu, Lei; Nag, Soham; Zou, Jia (October 2023, Proceedings of 2023 ACM Symposium on Cloud Computing (SoCC'23))

Decision forest, including RandomForest, XGBoost, and Light-GBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBirdfrom Microsoft, Nvidia FIL, and lleaves. While these frameworks are fully optimized for inference computations, they are all decoupled with databases and general data management frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database inference, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function(UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine-grained SQL operations. The relation-centric representation can achieve significantly better performance for large models. We optimized both implementations and conducted a comprehensive benchmark to compare these two implementations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark-SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.
more » « less
Serving Deep Learning Models from Relational Databases

https://doi.org/10.48786/edbt.2024.61

Zhou, Lixi; Lin, Qi; Chowdhury, Kanchan; Masood, Saif; Eichenberger, Alexandre; Min, Hong; Sim, Alexander; Wang, Jie; Wang, Yida; Wu, Kesheng; et al (January 2024, OpenProceedings.org)

Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains, sparking growing interest recently. In this visionary paper, we embark on a comprehensive exploration of representative architectures to address the requirement. We highlight three pivotal paradigms: The state-of-the-art \textit{DL-centric} architecture offloads DL computations to dedicated DL frameworks. The potential \textit{UDF-centric} architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS). The potential \textit{relation-centric} architecture aims to represent a large-scale tensor computation through relational operators. While each of these architectures demonstrates promise in specific use scenarios, we identify urgent requirements for seamless integration of these architectures and the middle ground in-between these architectures. We delve into the gaps that impede the integration and explore innovative strategies to close them. We present a pathway to establish a novel RDBMS for enabling a broad class of data-intensive DL inference applications.
more » « less
Automatic Data Transformation Using Large Language Model - An Experimental Study on Building Energy Data

https://doi.org/10.1109/BigData59044.2023.10386931

Sharma, Ankita; Li, Xuanmao; Guan, Hong; Sun, Guoxin; Zhang, Liang; Wang, Lanjun; Wu, Kesheng; Cao, Lei; Zhu, Erkang; Sim, Alexander; et al (December 2023, Proceedings of 2023 IEEE International Conference on Big Data (IEEE BigData 2023))

Existing approaches to automatic data transformation are insufficient to meet the requirements in many real-world scenarios, such as the building sector. First, there is no convenient interface for domain experts to provide domain knowledge easily. Second, they require significant training data collection overheads. Third, the accuracy suffers from complicated schema changes. To address these shortcomings, we present a novel approach that leverages the unique capabilities of large language models (LLMs) in coding, complex reasoning, and zero-shot learning to generate SQL code that transforms the source datasets into the target datasets. We demonstrate the viability of this approach by designing an LLM-based framework, termed SQLMorpher, which comprises a prompt generator that integrates the initial prompt with optional domain knowledge and historical patterns in external databases. It also implements an iterative prompt optimization mechanism that automatically improves the prompt based on flaw detection. The key contributions of this work include (1) pioneering an end-to-end LLM-based solution for data transformation, (2) developing a benchmark dataset of 105 real-world building energy data transformation problems, and (3) conducting an extensive empirical evaluation where our approach achieved 96% accuracy in all 105 problems. SQLMorpher demonstrates the effectiveness of utilizing LLMs in complex, domain-specific challenges, highlighting the potential of their potential to drive sustainable solutions.
more » « less
Benchmark of DNN Model Search at Deployment Time

https://doi.org/10.1145/3538712.3538725

Zhou, Lixi; Jain, Arindam; Wang, Zijie; Das, Amitabh; Yang, Yingzhen; Zou, Jia (July 2022, SSDBM '22: Proceedings of the 34th International Conference on Scientific and Statistical Database Management)

Deep learning has become the most popular direction in machine learning and artificial intelligence. However, the preparation of training data, as well as model training, are often time-consuming and become the bottleneck of the end-to-end machine learning lifecycle. Reusing models for inferring a dataset can avoid the costs of retraining. However, when there are multiple candidate models, it is challenging to discover the right model for reuse. Although there exist a number of model-sharing platforms such as ModelDB, TensorFlow Hub, PyTorch Hub, and DLHub, most of these systems require model uploaders to manually specify the details of each model and model downloaders to screen keyword search results for selecting a model. We are lacking a highly productive model search tool that selects models for deployment without the need for any manual inspection and/or labeled data from the target domain. This paper proposes multiple model search strategies including various similarity-based approaches and non-similarity-based approaches. We design, implement and evaluate these approaches on multiple model inference scenarios, including activity recognition, image recognition, text classification, natural language processing, and entity matching. The experimental evaluation showed that our proposed asymmetric similarity-based measurement, adaptivity, outperformed symmetric similarity-based measurements and non-similarity-based measurements in most of the workloads.
more » « less
Full Text Available

« Prev Next »

Search for: All records